Identifying Errors in Russian Web Corpora
نویسندگان
چکیده
Abstract The explosion of the Web leads to production large amounts texts and inevitably influences their quality. Errors that tend occur more often can distort results, especially when are used for scientific purposes, in language teaching or learning. Hence, there is a need examine existing corpora based on web clean up data, which may contain such “noisy” fragments. In our study, we deal with problem errors analyze Aranea Russicum Maximum corpus. Among errors, name, above all, encoding incorrect font types, as well segments written other languages. These phenomena result morphological analysis lemmatization, frequency distortion, fact lexical units cannot be found therefore displayed corpus users. paper focuses describes types outlines possible ways eliminate them.
منابع مشابه
Orthographic Errors in Web Pages: Toward Cleaner Web Corpora
Since the Web by far represents the largest public repository of natural language texts, recent experiments, methods, and tools in the area of corpus linguistics often use the Web as a corpus. For applications where high accuracy is crucial, the problem has to be faced that a non-negligible number of orthographic and grammatical errors occur in Web documents. In this article we investigate the ...
متن کاملGenerating Learner-Like Morphological Errors in Russian
To speed up the process of categorizing learner errors and obtaining data for languages which lack error-annotated data, we describe a linguistically-informed method for generating learner-like morphological errors, focusing on Russian. We outline a procedure to select likely errors, relying on guiding stem and suffix combinations from a segmented lexicon to match particular error categories an...
متن کاملAnnotation errors detection in TTS corpora
We investigate the problem of automatic detection of annotation errors in single-speaker read-speech corpora used for textto-speech (TTS) synthesis. Various word-level feature sets were used, and the performance of several detection methods based on support vector machines, extremely randomized trees, knearest neighbors, and the performance of novelty and outlier detection are evaluated. We sho...
متن کاملDetecting Annotation Errors in Spoken Language Corpora
Consistency of corpus annotation is an essential property for the many uses of annotated corpora in computational and theoretical linguistics. While some research addresses the detection of inconsistencies in part-of-speech and other positional annotation (van Halteren, 2000; Eskin, 2000; Dickinson and Meurers, 2003a), more recently work has also started to address errors in syntactic and other...
متن کاملIdentifying Comparable Corpora Using LDA
Parallel corpora have applications in many areas of Natural Language Processing, but are very expensive to produce. Much information can be gained from comparable texts, and we present an algorithm which, given any bodies of text in multiple languages, uses existing named entity recognition software and topic detection algorithm to generate pairs of comparable texts without requiring a parallel...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Jazykovedný ?asopis
سال: 2022
ISSN: ['0021-5597', '1338-4287']
DOI: https://doi.org/10.2478/jazcas-2022-0021